Course Description: This course will describe some of the algorithmic techniques developed for handling large amounts of data that is often available in limited ways. Topics that will be covered include data stream algorithms, sampling and sketching techniques, and sparsification, with applications to signals, matrices, and graphs. Emphasis will be on the theoretical aspects of the design and analysis of such algorithms.
Textbooks: There is no course text in general. However, we will be referring the following books.
Probability and Computing: Randomization and Probabilistic Techniques in Algorithms and Data Analysis (Michael Mitzenmacher Eli Upfal). Cambridge University Press. [PCMU]
Foundations of Data Science (Avrim Blum, John Hopcroft , Ravindran Kannan). Cambridge University Press. [FDSBHK]
Prerequisites: Mathematical maturity and comfort with undergraduate algorithms and basic probability. However, since this course is directed at senior level undergraduate students and beginning graduate students, and hence will give sufficient background in randomized algorithms. The pace of the course and the topics will be adjusted accordingly.
Lectures Timing and Schedule:
Lecture will be held every
Monday: 10:15 AM to 12:00 PM Bergen Time (13:45 to 3:30 PM India Time)
Wednesday: 12:15 to 14:00 PM Bergen Time (15:45 to 17:30 PM India Time)
Lecture 01: (19th August 2020) Introduction to Randomized Algorithms 1
[PCMU] Chapters 1 and 2, Basics of Discrete Probability
See also notes by Jeff Erickson and book by David Forsyth used in CS 361.
See also notes by Chandra on basic probability.
Lecture 02: (24th August 2020) Introduction to Randomized Algorithms 2
[PCMU] Chapters 1 and 2, Basics of Discrete Probability
See also notes by Jeff Erickson and book by David Forsyth used in CS 361.
See also notes by Chandra on basic probability.
Lecture 03: (26th August 2020) Introduction to Randomized Algorithms 3
[PCMU] Chapter 1--Quick Sort, Resorvoir Sampling, Starting Probabilistic Inequalities
Notes from Fall 2014 [Chandra (UIUC)] introductory lecture (see Section 5 on reservoir sampling)
See also notes by Chandra on basic probability.
Lecture 04: (31st August 2020) Introduction to Randomized Algorithms 4
[PCMU] Markov, Chebyshev and Chernoff (Chapters 3 and 4) and Load Balancing (Chapter 5)
See Sariel Har-Peled's notes on randomized algorithms, in particular Chapter 7 on Chernoff inequality and a cheat sheet
Lecture 05: (2nd September 2020) Introduction to Randomized Algorithms 5
Applications of Inequalities: Approximate Median and High Probability Anlysis of Quicksort. How to generare a number in the range [n] using coin tosses. Introduced Geometric Random Variable and then as an example did Coupon's Collector Problem (Chapter 2, [PCMU])
Lecture 06: (7th September 2020) Intrdoduction to Course: Streaming
Introduction to Streaming Paradigm, Sampling Primitives via random bits, Resorvoir Sampling
Notes from Fall 2014 of Chandra Course (see Section 5 on reservoir sampling)
Lecture 07: (9th September 2020) Morris Counter: Probabilistic counting
Morris Algorithm to count number of events
Lecture 1 from Fall 2014 of Chandra Course
Lecture 1 from Nelson/Indyk course from 2017.
Chapter 4 in Amit Chakrabarti's notes.
Counting large numbers of events in small registers by Morris, CACM 1978
Approximate counting: a detailed analysis by Flajolet.
Lecture 08: (21st September 2020) Introduction to Frequency Moments and Counting Distinct Elements (Idealized)
Introduction of Frequency Moments and an algorithm to count distinct elements using idelaized hash function, Started on Hash Function
Lecture notes from Nelson/Indyk course.
Lecture 2 from Fall 2014 of Chandra Course
Chapters 2 and 3 in Amit Chakrabarti's notes.
Original Flajolet-Martin paper based on hashing
A theoretically optimal algorithm in terms of space
Lecture 09: (23rd September 2020) Limited independence and Hashing
Avrim Blum's notes
Salil Vadhan's book on pseudorandomness (Section 3.5 on pairwise independence)
Mikkel Thorup's article on hashing
Fast hashing with Strong Concentration Bounds, STOC 2020 paper motivated by streaming applications.
Lecture 10: (28th September 2020) Counting Distinct Elements (Non-Idealized)
Lecture 11: (30th September 2020) AMS Sampling for Estimating Frequency moments and F2 Estimation
Lecture 12: (5th October 2020) Linear Sketches and Frequent Items (Misra-Gries Algorithm)
Chapter 1 in Amit Chakrabarti's notes
Lecture 13: (7th October 2020) Count-Min and Count Sketches
Lecture 6 from Fall 2014 of Chandra Course
Lecture 14: (12th October 2020) Applications of Count-Min and Count Sketches: Range and Point queries
Lecture 6 from Fall 2014 of Chandra Course
Lecture 15: (14th October 2020) Applications of Count-Min and Count Sketches: Sparse Recovery and Compressed Sensing
Video
Lecture 6 from Fall 2014 of Chandra Course
Lecture 16: (19th October 2020) Graph Streaming: Algorithms in Insertion-Only Stream
Slides, Annotated Slides
Video
Lecture 17: (21st October 2020) Graph Streaming: Connectivity Sketch in Dynamic Stream and Applicartions
Slides, Annotated Slides
Video
Assignments
Relevant Books, Surveys, Collected Lecture Notes
Course notes of Amit Chakrabarti focusing mostly on streaming algorithms.
Foundations of Data Science, by Avrim Blum, John Hopcroft and Ravi Kannan. Broad technical introduction covering foundations of several topics. University students should have access to the digital version but there is also a free draft version available here.
Mathematical Foundations of Data Analysis, Jeff Phillips.
Mining of Massive Datasets, Jure Leskovec, Anand Rajaraman and Jeff Ullman. Emphasis on the practical aspects
Sketching as a Tool for Numerical Linear Algebra, a survey by David Woodruff. Copy for personal use from author's website
Survey on sketching by Graham Cormode
Survey on Graph Streams, Andrew MacGregor.
Lecture Notes on Randomized Linear Algebra by Michael Mahoney. See also an older survey by same author.
A book in preparation on data stream algorithms by Andrew McGregor and Muthu Muthukrishnan
Surveys on core-sets. Core-sets: An updated survey by Dan Feldman. Coresets and Sketches by Jeff Phillips. Another one by Muneanu and Schwiegelshohn. Original one by Agarwal, Har-Peled and Varadarajan (see also Sariel's book).
Lecture notes on Massively Parallel Computing model and algorithms by Mohsen Ghaffari
Sublinear Algorithms is a topic that we will not cover in this course, however it is a related area and there are many resources collected at this website.
Resources on probability and randomized algorithms
Books on probability and computing: Probability and Computing (Mitzenmacher-Upfal), Randomized Algorithms (Motwani-Raghavan), The Probabilistic Method (Alon-Spencer), Concentration of Measure (Dubhashi-Panconesi)
A survey on concentration inequalities by Fan Chung and Linyuan Li.
Probabilistic Tools for the Analysis of Randomized Optimization Heuristics by Benjamin Doerr
Pseudorandomness by Salil Vadhan.
Related courses
Algorithms for Big Data: Chandra (UIUC), Fall 2014
Algorithms for Big Data: Chandra (UIUC), Spring 2019
Data Stream Algorithms: Amit Chakrabarti (Dartmouth), Spring 2020
Algorithms for Big Data: Jelani Nelson (Harvard), Fall 2015. Includes Video Lectures
Algorithms for Massive Data, Alex Andoni (Columbia), Spring 2019.
Algorithms for Big Data, David Woodruff (CMU), Fall 2019
Sketching Algorithms for Big Data: Piotr Indyk (MIT) and Jelani Nelson (Harvard), Fall 2017
Algorithmic Techniques for Big Data: Moses Charikar (Stanford), Winter 2020
Data Stream Algorithms, Andrew MacGregor (UMass Amherst), Spring 2018.
Models of Computation for Massive Data: Jeff Philips (Utah)